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are actually needed. Recent hardware-based stride prefetching techniques mostly rely on 
the processor pipeline information (e.g. program counter and branch prediction table) for 
prediction. Continuing developments in processor microarchitecture drastically change core 
pipeline design and require that existing hardware-based stride prefetching techniques be 
adapted to the evolving new processor architectures. ... 

Quantify ing l oop nest locality using SPEC/95 and the perieoi benchmarks 
Kathryn S. McKinley, Olivier Temam 

November 1999 ACM Transactions on Computer Systems (TOCS), volume 17 issue 4 

Additional Information: fulj cjJaMo.. abstract, references, citings, Index 



Full text available: W pdf{635.63 KB) 

This article analyzes and quantifies the locality characteristics of numerical loop nests in 
order to suggest future directions for architecture and software cache optimizations. Since 
most programs spend the majority of their time in nests, the vast majority of cache 
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This paper analyzes and quantifies the locality characteristics of numerical loop nests in 
order to suggest future directions for architecture and software cache optimizations. Since 
most programs spend the majority of their time in nests, the vast majority of cache 
optimization techniques target loop nests. In contrast, the locality characteristics that drive 
these optimizations are usually collected across the entire application rather than the nest 
level. Indeed, researchers have studied nume ... 



8 Session 1 7: architecture: Sun der; a programmable hardwa r e prefetch arc hitecture for Q 

numericMJoops. 
Tzi-cker Chiueh 

November 1994 Proceedings of the 1994 ACM/IEEE conference on Supercomputing 

Full text available: "gj .odf(922.3S KB- Additional Information: full citation , abstract, references- , citings 

Beyond data caching, data prefetching is by far the most effective way to address the 
memory access bottleneck associated with high-performance processors. This is particularly 
true for scientific programs whose working sets cannot be easily fit into the on-chip data 
cache. This paper proposes a new data prefetching architecture called Sunder, which 
combines the flexibility and accurateness of software prefetching and the transparency and 
low-overhead of hardware prefetching. Th ... 
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Despite large caches, main-memory access latencies still cause significant performance 
losses in many applications. Numerous hardware and software prefetching schemes have 
been proposed to tolerate these latencies. Software prefetching typically provides better 
prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the 
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Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one 
of several approaches for tolerating memory latencies. Prefetching can be either hardware- 
based or software-directed or a combination of both. Hardware-based prefetching, requiring 
some support unit connected to the cache, can dynamically handle prefetches at run-time 
without compiler intervention. Software-directed approaches rely on compiler technology to 
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Vector processors have typically used vector registers, interleaved memory, and pipelined 
access to data to provide sufficient memory system performance. Caches have been used 
mainly for instructions and scalar data, while vectors are usually uncached, presumably 
partially because of the belief that there is insufficient vector locality in these workloads. In 
this study we use memory address traces from an Ardent Titan to examine both reference 
locality and cache performance in a vector pro ... 
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This paper focuses on extending the memory subsystem by integrating a prefetch buffer 
mechanism. Prefetching allows high-level application knowledge to increase memory 
performance, which is currently constraining the performance of most system. While 
prefetching does not reduce the latency of memory accesses, it hides this latency by 
overlapping memory access and instruction execution. The first prefetch operation to the 
buffer is initiated by an explicit fetch instruction. All f ... 
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latency by introducing slide-windowed floating-point registers with data preloading feature 
and pipelined memory. The architecture can hold upward compatibility with existing scalar 
architectures. In the new architecture, software can control the window structure. This is the 
advantage compared with our previous work of regi ... 
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Numerical applications frequently contain nested loop structures that process large arrays of 
data. The execution of these loop structures often produces memory reference patterns that 
poorly utilize data caches. Limited associativity and cache capacity result in cache conflict 
misses. Also, non-unit stride access patterns can cause low utilization of cache lines. Data 
copying has been proposed and investigated in order to reduce cache conflict misses, but 
this technique has a high executio .... 
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Today's commodity microprocessors require a low latency memory system to achieve high 
sustained performance. The conventional high-performance memory system provides fast 
data access via a large secondary cache. But large secondary caches can be expensive, 
particularly in large-scale parallel systems with many processors (and thus many 
caches). We evaluate a memory system design that can be both cost-effective as well as 
provide better performance, particularly for scientific workloads: a single ... 
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Prefetching is one approach to reducing the latency of memory operations in modern 
computer systems. In this paper, we describe the Markov prefetches This prefetcher acts as 
an interface between the on-chip and off-chip cache, and can be added to existing computer 
designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions 
from the memory subsystem, and then prioritizing the delivery of those references to the 
processor.This design results in a pr ... 
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